The data have been downloaded from the website and a basic cleaning has been done in the preparation step. Due to the size of the dataset, which is 2.5 Gb for 3 months of data, a sample is obtained in order to explore the data in more efficiently.
The preparation of the data include:
Load the libraries
The sample data is read after the preparation step which includes a basic cleaning and preparation of the data.
data <- read.csv("data/taxi_sample_data.csv")
Besides the records of trips, the website includes a .csv with the taxi zones in New York
zones <- read.csv("data/taxi_zones.csv")
plot(spZones["zone"])
From the shp file, we have also information about boroughs:
borough <- ggplot() + geom_sf(data=spZones, aes(fill = borough))
borough
There is also information about the kind of service area in the .csv
a <- merge(spZones, zones[,c(1,4)], by='LocationID')
service <- ggplot() + geom_sf(data=a, aes(fill = service_zone))
service
summary(data)
## VendorID tpep_pickup_datetime tpep_dropoff_datetime
## Min. :1.000 2017-06-28 20:31:05: 3 2017-03-18 03:32:29: 3
## 1st Qu.:1.000 2017-11-27 20:08:58: 3 2017-06-05 19:30:40: 3
## Median :2.000 2017-03-01 18:04:24: 2 2017-03-01 18:11:50: 2
## Mean :1.549 2017-03-01 19:07:32: 2 2017-03-01 19:20:12: 2
## 3rd Qu.:2.000 2017-03-01 19:38:59: 2 2017-03-01 19:33:17: 2
## Max. :2.000 2017-03-01 21:44:09: 2 2017-03-01 20:13:47: 2
## (Other) :43751 (Other) :43751
## passenger_count trip_distance RatecodeID store_and_fwd_flag
## Min. :1.000 Min. : 0.510 Min. :1.000 N:43598
## 1st Qu.:1.000 1st Qu.: 1.200 1st Qu.:1.000 Y: 167
## Median :1.000 Median : 1.900 Median :1.000
## Mean :1.619 Mean : 2.942 Mean :1.002
## 3rd Qu.:2.000 3rd Qu.: 3.500 3rd Qu.:1.000
## Max. :6.000 Max. :52.200 Max. :4.000
##
## PULocationID DOLocationID payment_type fare_amount
## Min. : 4.0 Min. : 3.0 Min. :1 Min. : 2.50
## 1st Qu.:113.0 1st Qu.:100.0 1st Qu.:1 1st Qu.: 7.00
## Median :161.0 Median :161.0 Median :1 Median : 10.00
## Mean :161.5 Mean :158.4 Mean :1 Mean : 12.71
## 3rd Qu.:231.0 3rd Qu.:234.0 3rd Qu.:1 3rd Qu.: 15.00
## Max. :265.0 Max. :265.0 Max. :1 Max. :245.00
##
## extra mta_tax tip_amount tolls_amount
## Min. :0.5000 Min. :0.5 Min. : 0.010 Min. : 0.0000
## 1st Qu.:0.5000 1st Qu.:0.5 1st Qu.: 1.550 1st Qu.: 0.0000
## Median :0.5000 Median :0.5 Median : 2.150 Median : 0.0000
## Mean :0.6649 Mean :0.5 Mean : 2.765 Mean : 0.2123
## 3rd Qu.:1.0000 3rd Qu.:0.5 3rd Qu.: 3.200 3rd Qu.: 0.0000
## Max. :1.0000 Max. :0.5 Max. :425.470 Max. :37.7600
##
## improvement_surcharge total_amount
## Min. :0.3 Min. : 5.31
## 1st Qu.:0.3 1st Qu.: 10.30
## Median :0.3 Median : 13.56
## Mean :0.3 Mean : 17.16
## 3rd Qu.:0.3 3rd Qu.: 19.80
## Max. :0.3 Max. :445.27
##
Main characteristics seen in summary are related to the numeric variables, where maximum values are much higher than the mean or the median: total amount, tip amount, tolls, fare amount etc. This could be related to outliers or skewed distributions of the data.
We will now explore the numeric variables and look for the main characteristics and data distributions.
# extract from the dataset the numeric variables
df <- data[, c(5,11,14,15,17)]
df <- melt(df)
## No id variables; using all as measure variables
# Theme characteristics for plots.
myTheme <- custom.theme(pch=20, cex=0.5)
myTheme$strip.background$col <- "white"
# Frequency of data in each variable
histogram(~value|variable, data=df, par.settings=myTheme, xlab=" ", scales='free')
The boxplot can also help to see the data distribution. There are some data records which present very high values for the numeric variables.
ggplot(df, aes(y=value)) +
geom_boxplot(aes(fill=value),outlier.size = 0.1) +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free')
We include new conditions for the numeric variables. The total, fare and tip amount have extremely large values.
We also check that some records have lower fares than tips. Aditionally, the tip should be at least 10% of the fare and we also consider only tips below 30% of the fare. The idea is to recommend a tip depending on other variables, but a tip above 30% cannot be recommended.
Remove records with fares lower than tips and tips lower than 10% and above 30%:
data <- data[data$fare_amount > data$tip_amount,]
data <- data[data$tip_amount > 0.1*data$fare_amount,]
data <- data[data$tip_amount < 0.3*data$fare_amount,]
histogram(data$tip_amount, par.settings=myTheme, xlab="tip amount")
How it is this in %?
histogram((data$tip_amount/data$fare_amount)*100, par.settings=myTheme, xlab="tip amount")
Most of the tips are between the 20 and 25% of the fare amount.
We can plot the rest of the variables together:
# extract from the dataset the numeric variables
df <- data[, c(5,11,14,17)]
df <- melt(df)
# Theme characteristics for plots.
myTheme <- custom.theme(pch=20, cex=0.5)
myTheme$strip.background$col <- "white"
# Frequency of data in each variable
histogram(~value|variable, data=df, par.settings=myTheme, xlab=" ",scales='free')
All the records are centered around small values.
Another numeric variable that could be relevant is the duration of each trip. We can calculate this variable from the pick up and drop off time.
Time stamp are loaded as factor. Translate into daytime:
data$tpep_pickup_datetime <- as.POSIXct(as.character(data$tpep_pickup_datetime), tz = "GMT", format="%Y-%m-%d %H:%M:%S")
data$tpep_dropoff_datetime <- as.POSIXct(as.character(data$tpep_dropoff_datetime), tz = "GMT",format="%Y-%m-%d %H:%M:%S")
## transform to NY time zone
data$tpep_pickup_datetime <- with_tz(data$tpep_pickup_datetime, tzone = "America/New_York")
data$tpep_dropoff_datetime <- with_tz(data$tpep_dropoff_datetime, tzone = "America/New_York")
## new column with the differences:
## it takes some minutes
data$trip_timelength <- apply(data, 1, FUN=function(x) difftime(x[3], x[2]))
# Plot the new variable
histogram(data$trip_timelength, par.settings=myTheme, xlab="trip time duration", scales='free')
The data have negative values that should be removed
data <- data[data[,"trip_timelength" ] > 5,]
df <- data[, c(5,11,14,15,17,18)]
df <- melt(df)
histogram(data$trip_timelength, par.settings=myTheme, xlab="trip time duration", scales='free')
We can now plot all the numeric variables together again and we will see how the largest values in the fare amount have been removed at the same time that the negative trip duration. The density functions helps with the shape of the data distributions.
# extract from the dataset the numeric variables
df <- data[, c(5,11,14,17,18)]
df$tipPer <- df[,3]/df[,2]*100
df <- melt(df)
## No id variables; using all as measure variables
# Theme characteristics for plots.
myTheme <- custom.theme(pch=20, cex=0.5)
myTheme$strip.background$col <- "white"
# Frequency of data in each variable
ggplot(df, aes(y=value)) +
geom_boxplot(aes(fill=value),outlier.size = 0.1) +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free')
We can have a look on the scatterplots of the numerical variables: a linear relationship is seen, but with some stratified characteristics. If we hava a look on the tip amount at the center of the scatterplot matrix, we can see that there are some values of constant ‘tip amount’ independently of the fare amount.
splom(data[,c(5,11,14,17,18)], cex=0.3,alpha=0.5,par.settings=myTheme, auto.key=TRUE)
Another variable that could be relevant is the time of pick up or drop off the taxi. Create new columns with the time (hour) of the pick up and drop off.
d1 <- as.character(data$tpep_pickup_datetime)
d2 <- as.character(data$tpep_dropoff_datetime)
data$t_pu <- hour(data$tpep_pickup_datetime)
data$t_do <- hour(data$tpep_dropoff_datetime)
Plot the frequency of records by hour of bick up.
df <- data[, c("tip_amount","t_pu", "t_do")]
barchart(table(data$t_pu), par.settings=myTheme, xlab=" ", horizontal=FALSE)
It can be seen that there is little amount of data between 2 to 9 a.m compared to the rest of the day. Less taxi trips are made on overnight hours. We see if there is any different on the tip depending on the pick up hour:
df <- df %>%
group_by(t_pu) %>%
summarise(m = median(tip_amount))%>%
as.data.frame()
dotplot(m~as.factor(t_pu), data=df, pch=20, grid=TRUE, cex=2, ylab='median tip', xlab='time', par.settings=myTheme)
Between 0 and 8 in the morning, the median tip amount behaves different, but we should be careful, due to the fact that it is the beriod with less records and the conclusions might not be representative.
We can ask ourselve if it does the time of pick up influence on that amount. The median tip % of each record is representd for each hour. Again, it is difficult to find a conclusion due to the lack of data in the range of 0 to 9 hours, but highest value is found at 9 a.m and a trend for higher values between 11 to 12.
df <- data[ ,c("t_pu", "t_do")]
df$tipPer <- data[,14]/data[,11]*100
df <- df %>%
group_by(t_pu) %>%
summarise(m = median(tipPer))%>%
as.data.frame()
dotplot(m~as.factor(t_pu), data=df, pch=20, grid=TRUE, cex=2, ylab='median % tip', xlab='time', par.settings=myTheme)
After having seen the numeric variables of the dataset, we take care of the categorical variables that it includes.
which(data$payment_type != 1) # records of payments different than 1
## integer(0)
histogram(data$RatecodeID, par.settings=myTheme)
which(data$RatecodID != 1)
## integer(0)
From the zones .csv there is information about the zones LocationID, Borough and service zone. We create 2 columns in the dataframe with the PU (pick up) and DO (drop off) service and borough zone.
We can see that LocationID have 265 values, but zones only have information about 263 zones. In addition, the shape file has no information on this zones. Due to that, we decide to remove these records before continue:
data <- data[data$PULocationID <= 263,]
data <- data[data$DOLocationID <= 263,]
tail(zones)
## LocationID Borough Zone service_zone
## 260 260 Queens Woodside Boro Zone
## 261 261 Manhattan World Trade Center Yellow Zone
## 262 262 Manhattan Yorkville East Yellow Zone
## 263 263 Manhattan Yorkville West Yellow Zone
## 264 264 Unknown NV N/A
## 265 265 Unknown <NA> N/A
zones <- zones[1:263,]
Create 2 variables for the PU and DO service zone
data$PU_servicezone <- data$PULocationID
data$DO_servicezone <- data$DOLocationID
## replace the ID code with the zone in the zone dataset
repl <- function(x) {
for (i in 1:263) {
x[which(x["PU_servicezone"] == i), "PU_servicezone"] <- as.character(zones[i,"service_zone"])
x[which(x["DO_servicezone"] == i), "DO_servicezone"] <- as.character(zones[i,"service_zone"])
}
x }
data <- repl(data)
data$PU_servicezone <- factor(data$PU_servicezone, levels=c("Airports","Boro Zone", "EWR", "Yellow Zone"))
data$DO_servicezone <- factor(data$DO_servicezone, levels=c("Airports","Boro Zone", "EWR", "Yellow Zone"))
create 2 variables for the PU and DO borough
data$PU_boroughzone <- data$PULocationID
data$DO_boroughzone <- data$DOLocationID
## replace the ID code with the zone in the zone dataset
repl <- function(x) {
for (i in 1:263) {
x[which(x["PU_boroughzone"] == i), "PU_boroughzone"] <- as.character(zones[i,"Borough"])
x[which(x["DO_boroughzone"] == i), "DO_boroughzone"] <- as.character(zones[i,"Borough"])
}
x }
data <- repl(data)
data$PU_boroughzone <- factor(data$PU_boroughzone, levels=c("Bronx", "Brooklyn", "EWR", "Manhattan", "Queens", "Staten Island"))
data$DO_boroughzone <- factor(data$DO_boroughzone, levels=c("Bronx", "Brooklyn", "EWR", "Manhattan", "Queens", "Staten Island"))
We can now plot the frequency of each trip depending on its pick up and drop off area. The next figure shows how most of the trips are picked up in Manhattan, and they finish in the same area. Few trips that started in Manhattan finish in Brooklyd and Queens.
df <- data[,c(23,24)]
barchart(table(df), auto.key=TRUE, par.settings=myTheme,xlab="times", ylab='pick up')
It can be observed that most of the trips start at Manhattan and end there as well. A amaller amount of the trips that start at Manhattan go to Brooklyn or Queens. There is few data records that show trips starting also at Queens and Brooklyn, ending in a similar amount in Manhattan, Brooklyn or Queens. These means that most of the flows are inside or to the borough of Manhattan.
The same can be done by service zone, where we have information about the specific location of airports, what could be interesting.
df <- data[,c(21,22)]
barchart(table(df), auto.key=TRUE, par.settings=myTheme,xlab="times", ylab='pick up')
From that graph we see that half of the trips from airports end in the ‘Yellow Zone’ which is mostly the Manhattan area, as we could see in the figures at the beginning of the report.
We can see this variables on a map, taking advance of the higher resolution of the PU and DO location ID:
Number of times that the taxi is pick up by zone
PU <- data %>%
group_by(PULocationID) %>%
summarise(no_rows = length(PULocationID)) %>%
as.data.frame()
names(PU) <- c("LocationID","PU_times")
b <- merge(spZones, PU, by='LocationID')
## logaritmic scale because there is a lot of difference between airports (maximum) and the rest
pu <- ggplot() + geom_sf(data=b, aes(fill = PU_times))+scale_fill_continuous(trans = "log10", type='viridis')
pu
Number of times that the taxi is drop off up by zone
DO <- data %>%
group_by(DOLocationID) %>%
summarise(no_rows = length(DOLocationID))%>%
as.data.frame()
names(DO) <- c("LocationID","DO_times")
c <- merge(spZones, DO, by='LocationID')
do <- ggplot() + geom_sf(data=c, aes(fill = DO_times))+scale_fill_continuous(trans = "log10",type='viridis')
do
As it is the yellow taxi data, most of the trips are in the Manhattan area.
We can create a new categorical variable related to the time of the day in which the trip is made. From the ‘hour’ variable we can detect if it is ‘daytime’ or nigth.
data$t_puH <- data$t_pu
data$t_doH <- data$t_do
data[which(data[ ,"t_puH"] >= 7 & data[ ,"t_puH"] <= 20), 't_puH'] <- "day"
data[which(data[ ,"t_puH"] != "day"), 't_puH'] <- "night"
data[which(data[ ,"t_doH"] >= 7 & data[ ,"t_doH"] <= 20), 't_doH'] <- "day"
data[which(data[ ,"t_doH"] != "day"), 't_doH'] <- "night"
data$t_puH <- as.factor(data$t_puH)
data$t_doH <- as.factor(data$t_doH)
histogram(data$t_puH, par.settings=myTheme)
Most of the trips are made during the day time.
We consider now the day of the week as well as a categorical variable, in order to see if there is a day in which people take more the taxi or if it influences on the fares/tips etc.
data$weekday <- weekdays(data$tpep_pickup_datetime)
data$weekday <- factor(data$weekday, levels=c("domingo","lunes","martes", "miércoles","jueves","viernes","sábado"))
# plot
barchart(as.factor(data$weekday), par.settings=myTheme, horizontal=FALSE,xlab='day of the week')
Although most of the days have similar amount of records, it is clear the day when there are less taxi trips in the New York city.
We create another categorical variable related to the intra characteristic of the record. We consider trips that pick up and drop off in a different borough as intratrip.
data$intratrip <- seq(1:nrow(data))
data[which(data$PU_boroughzone != data$DO_boroughzone), "intratrip"] <- "yes"
data[which(data$PU_boroughzone == data$DO_boroughzone), "intratrip"] <- "no"
histogram(as.factor(data$intratrip), par.settings=myTheme, xlab='intra-trip')
First, we represent numeric variables depending on some categorical variables like: the pick up or drop off area or the rate code at the end of the trip.
## all together:
tip <- melt(data[,c(14,23,24)], 'tip_amount')
trip <- melt(data[,c(5,23,24)], 'trip_distance')
fare <- melt(data[,c(11,23,24)], 'fare_amount')
total <- melt(data[,c(17,23,24)], 'total_amount')
trip_time <- melt(data[,c(18,23,24)], 'trip_timelength')
all <- cbind(trip, tip, total, fare,trip_time)
all <- all[,c(1,4,7,10,13,14,15)]
b <- melt(all)
## Using variable, value as id variables
names(b) <- c("pu_do","borough","variable","value")
b$pu_do <- as.character(b$pu_do)
b[(b$pu_do == "PU_boroughzone"), "pu_do"] <- 'PU'
b[(b$pu_do == "DO_boroughzone"), "pu_do"] <- 'DO'
ggplot(b, aes(x=pu_do,y=value)) +
geom_boxplot(aes(fill=borough),outlier.size = 0.1) + scale_fill_brewer(palette="Paired") +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free', ncol=2)
all <- data[,c(5,11,14,17,18,27)]
all <- melt(all, 'weekday')
ggplot(all, aes(x=weekday,y=value)) +
geom_boxplot(aes(fill=weekday),outlier.size = 0.1) + scale_fill_brewer(palette="Paired") +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free', ncol=2)
trip <- melt(data[,c(5,28)], 'trip_distance')
tip <- melt(data[,c(14,28)], 'tip_amount')
fare <- melt(data[,c(11,28)], 'fare_amount')
total <- melt(data[,c(17,28)], 'total_amount')
trip_time <- melt(data[,c(18,28)], 'trip_timelength')
all <- cbind(trip, tip, total, fare, trip_time)
all <- all[,c(1,4,7,10,13,14,15)]
b <- melt(all)[,2:4]
## Using variable, value as id variables
names(b) <- c("intratrip","variable","value")
ggplot(b, aes(x=intratrip,y=value)) +
geom_boxplot(aes(fill=intratrip),outlier.size = 0.1) + scale_fill_brewer(palette="Paired") +
labs(x=" ", y=" ") +
facet_wrap(~ variable, scales='free')
It seems to be a clear increase in all the numeric variables for when the taxi trip goes from a borough to another. However, if we have a closer look to the tip variable but in percentage:
## tip percentage
tip <- data[,c(11,14,28)]
tip$tipPer <- tip[,2]/tip[,1]*100
ggplot(tip, aes(x=intratrip,y=tipPer)) +
labs(x=" ", y="tip % ")+
geom_boxplot(aes(fill=intratrip),outlier.size = 0.1)+ scale_fill_brewer(palette="Paired")
The median tip % is lower for the intra-borough trips although as seen in previous figure the tio amount is largest.
We can also evaluate if the tip % changes with the time the taxi is pick up.
## tip percentage
tip <- data[,c(11,14,25)]
tip$tipPer <- tip[,2]/tip[,1]*100
ggplot(tip, aes(x=t_puH,y=tipPer)) +
labs(x=" ", y="tip % ")+
geom_boxplot(aes(fill=t_puH),outlier.size = 0.1)+ scale_fill_brewer(palette="Paired")
splom(~data[,c(5,11,14,17,18)]|as.factor(data$PU_servicezone), par.settings=myTheme, cex=0.1, groups=data$t_puH, auto.key=TRUE, data=data,varnames=c("distance","fare","tip","total","duration"))
Main differences are found in the time length of the trips, where the daytime records show data where the same distance takes longer in comparison with night-time recors. Also we observe how most of the data of airports is taken during the day and the same distance takes clearly less amount of time during night hours. This is probably due to traffic hours.
data$tipPer <-data[,14]/data[,11]*100
PU_tip<- data %>%
group_by(PULocationID) %>%
summarise(no_rows = mean(tipPer)) %>%
as.data.frame()
names(PU_tip) <- c("LocationID","PU_mean")
DO_tip<- data %>%
group_by(DOLocationID) %>%
summarise(no_rows = mean(tipPer)) %>%
as.data.frame()
names(DO_tip) <- c("LocationID","DO_mean")
plot
d <- merge(spZones, PU_tip, by='LocationID')
pu_tip<- ggplot() + geom_sf(data=d, aes(fill = PU_mean))+scale_fill_continuous(type='viridis')
e <- merge(spZones, DO_tip, by='LocationID')
do_tip<- ggplot() + geom_sf(data=e, aes(fill = DO_mean))+scale_fill_continuous(type='viridis')
pu_tip
do_tip
trip length
PU_length <- data %>%
group_by(PULocationID) %>%
summarise(no_rows = mean(trip_distance))%>%
as.data.frame()
names(PU_length) <- c("LocationID","PU_length")
DO_length <- data %>%
group_by(DOLocationID) %>%
summarise(no_rows = mean(trip_distance))%>%
as.data.frame()
names(DO_length) <- c("LocationID","DO_length")
plot
f <- merge(spZones, PU_length, by='LocationID')
pu_length <- ggplot() + geom_sf(data=f, aes(fill = PU_length))+scale_fill_continuous(type='viridis')
g <- merge(spZones, DO_length, by='LocationID')
do_length <- ggplot() + geom_sf(data=g, aes(fill = DO_length))+scale_fill_continuous(type='viridis')
pu_length
do_length
Some insights can be derived from the dataset of the taxi data:
About the cleaning process:
In general:
The data is download from the website. As in the exploratory analysis, a sample of the data is used, due to the computational limitations.
After reading the data, it is cleaned taking into account thefindings of the exploratory data analysis and some new variables are included:
In the cleaning phase:
The other variables included are: * Duration of the trip * The day of the week * The ‘intratrip’ variable
The prepared data is stored as a .csv called taxi_model_sample_data.csv. The script where this is done is called: model_sample.R
The dataset is then divided in 2. 80% of the data is used as training data and the other 20% is used to test the model.
data <- read.csv("data/taxi_model_sample_data.csv")
#########################
## sample the data for modelling:
# Random sample indexes
train_index <- sample(1:nrow(data), 0.8 * nrow(data))
test_index <- setdiff(1:nrow(data), train_index)
# Build X_train, y_train, X_test, y_test
X_train <- data[train_index, -3]
y_train <- data[train_index, "tip_amount"]
X_test <- data[test_index, -3]
y_test <- data[test_index, "tip_amount"]
## save these datasets:
write.csv(X_train, "data/xtrain.csv", row.names=FALSE)
write.csv(y_train, "data/ytrain.csv",row.names=FALSE)
write.csv(X_test, "data/xtest.csv",row.names=FALSE)
write.csv(y_test, "data/ytest.csv",row.names=FALSE)
Code for the model run can be found in ‘model_run.R’, but I will reproduce it here.
## Built and run the model
## As seen in the EDA there is linear relationship among the numeric variables selected. Some of them are linear dependent on other, what should be seen in our model.
library(MASS)
##
## Attaching package: 'MASS'
## The following objects are masked from 'package:raster':
##
## area, select
## The following object is masked from 'package:dplyr':
##
## select
## Load the train data:
xtrain <- read.csv("data/xtrain.csv")
ytrain <- read.csv("data/ytrain.csv")
train <- cbind(ytrain, xtrain)
names(train) <- c("tip_amount",names(xtrain))
We first test a multivariate linear regression model. The numerical variables include will be the fare amount, the trip duration, the trip distance. Additionally we will include 2 categorical variables: the day of the week and if the trip is inside the same borough or not (the ‘intratrip’ variable).
Run the model
## 1.1 model: use all the variables except the total amount that depends on the dependent variable 'tip'.
train <- train[, -4]
lm <- lm(tip_amount~. ,data=train)
summary(lm)
##
## Call:
## lm(formula = tip_amount ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6836 -0.1019 0.1087 0.1904 5.3176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.127109 0.011419 11.131 < 2e-16 ***
## trip_distance -0.006633 0.006066 -1.093 0.27421
## fare_amount 0.197551 0.003091 63.918 < 2e-16 ***
## trip_timelength 0.002625 0.001138 2.308 0.02101 *
## intratripyes 0.212188 0.007636 27.788 < 2e-16 ***
## weekdayjueves 0.031655 0.010106 3.132 0.00173 **
## weekdaylunes 0.044432 0.010576 4.201 2.66e-05 ***
## weekdaymartes 0.047217 0.010515 4.490 7.11e-06 ***
## weekdaymiércoles 0.041380 0.010188 4.062 4.88e-05 ***
## weekdaysábado -0.002613 0.010750 -0.243 0.80796
## weekdayviernes 0.024764 0.010122 2.447 0.01443 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6499 on 105364 degrees of freedom
## Multiple R-squared: 0.863, Adjusted R-squared: 0.863
## F-statistic: 6.639e+04 on 10 and 105364 DF, p-value: < 2.2e-16
##confidence <- confint(lm)
lm$coefficients
## (Intercept) trip_distance fare_amount trip_timelength
## 0.127108597 -0.006632728 0.197550635 0.002625397
## intratripyes weekdayjueves weekdaylunes weekdaymartes
## 0.212187749 0.031655499 0.044432402 0.047217235
## weekdaymiércoles weekdaysábado weekdayviernes
## 0.041379968 -0.002612863 0.024764283
The training show a good performance of the linear regression. We can check for the importance of the different variables.
Test the model
xtest <- read.csv("data/xtest.csv")
xtest <- xtest[,-3] ## remove the total amount
ytest <- read.csv("data/ytest.csv")
predictions <- predict(lm, xtest, interval="prediction")
results <- cbind(ytest, predictions)
summary(lm)
##
## Call:
## lm(formula = tip_amount ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6836 -0.1019 0.1087 0.1904 5.3176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.127109 0.011419 11.131 < 2e-16 ***
## trip_distance -0.006633 0.006066 -1.093 0.27421
## fare_amount 0.197551 0.003091 63.918 < 2e-16 ***
## trip_timelength 0.002625 0.001138 2.308 0.02101 *
## intratripyes 0.212188 0.007636 27.788 < 2e-16 ***
## weekdayjueves 0.031655 0.010106 3.132 0.00173 **
## weekdaylunes 0.044432 0.010576 4.201 2.66e-05 ***
## weekdaymartes 0.047217 0.010515 4.490 7.11e-06 ***
## weekdaymiércoles 0.041380 0.010188 4.062 4.88e-05 ***
## weekdaysábado -0.002613 0.010750 -0.243 0.80796
## weekdayviernes 0.024764 0.010122 2.447 0.01443 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6499 on 105364 degrees of freedom
## Multiple R-squared: 0.863, Adjusted R-squared: 0.863
## F-statistic: 6.639e+04 on 10 and 105364 DF, p-value: < 2.2e-16
The model works similar in the test data. We can check the predictions and we can also see that the errors are normally distributed.
## Normal distribution of residuals.
histogram(lm$residuals, par.settings=myTheme)
We can use the RMSE as an indicator of the model performance:
## RMSE
RMSE_lm <- sqrt(mean((results$x - results$fit)^2))
We now check a random forest in order to check if it improves the predictions of the multilinear model.
Run the model
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
# It takes long
rf_model <- randomForest(tip_amount ~., data = train, importance = TRUE, ntree=300)
rf_model
##
## Call:
## randomForest(formula = tip_amount ~ ., data = train, importance = TRUE, ntree = 300)
## Type of random forest: regression
## Number of trees: 300
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 0.4750178
## % Var explained: 84.6
importance(rf_model)[,1] ## show the importance of the variables
## trip_distance fare_amount trip_timelength intratrip weekday
## 19.510584 20.783364 19.666660 11.946406 2.511162
Test the model
rf_predict <- predict(rf_model, xtest)
results <- cbind(ytest, rf_predict)
plot(results$x~results$rf_predict, xlab='predictions', ylab='test', cex=0.5)
abline(0,1, col='red')
We can compute the error and RMSE as well.
## R2
RMSE_rf <- sqrt(mean((results$x-results$rf_predict)^2))
RMSE_rf
## [1] 0.7101
The relative error can be represented. Most of the time the error is small, but there are few predictions that are far from the real value.
err <- round((abs(results$x-results$rf_predict)/results$x)*100, digits=1)
histogram(err, par.settings=myTheme)